Index ¦ Archives

Mapping Big Data with Dask and Datashader

In this notebook I'd like to show, how we can plot a large dataset (36 million points, in this case) on a single machine (Macbook air, in my case) using two new libraries, - dask and datashader

In [1]:
%matplotlib inline
import pylab as plt
/Users/casy/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
In [13]:
from ipynotifyer import notifyOnComplete as nf
In [2]:
import numpy as np
import pandas as pd
In [3]:
import datashader as ds
import datashader.transfer_functions as tf
In [4]:
from dask import dataframe as dd
import dask
In [5]:
from functools import partial

from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9, Hot, viridis, inferno
from IPython.core.display import HTML, display
In [6]:
from pyproj import Proj # reproject points to State Plane

nyc = Proj(init='epsg:2263')

def reproj(df, prj=nyc):
    d = nyc(df['lon'].values,  df['lat'].values)
    df[['x','y']] = pd.DataFrame({'x':d[0],'y':d[1]})
    return df

Get the data

In [18]:
dsk = dd.read_csv('data/data*.csv', encoding='utf8')
In [8]:
len(dsk) # size of the dataset
/Users/casy/anaconda/lib/python2.7/multiprocessing/pool.py:113: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
  result = (True, func(*args, **kwds))
Out[8]:
35998001

Process the data

  • column as categorical, lower
In [19]:
dsk = dsk.assign(application=dsk.application.str.lower()) #.astype('category'))
  • reproject to NYC state plane
In [20]:
dsk = dsk.map_partitions(reproj)
  • add daytime in seconds
In [21]:
dsk = dsk.assign(daytime=dsk.timestamp.mod(86500))

Now let`s play with dask graph visualisation, just because it is awesome. As we can see, data is split into many "chunks" of data, and for each a set of transformations is performed (all operations are row-wise for now).

In [22]:
dsk.visualize()
Out[22]:

And now let's actually comute the result.

In [23]:
d = dsk.compute()

VISUALISATION

Now lets prepare to visualise our map using datashader.

First, let's define a canvas size

In [25]:
plot_width  = int(1000)
plot_height = plot_width

background = "black"

Datashader examples propose to use partial helper - we don't want to define background stile every time

In [26]:
export = partial(export_image, background = background)
cm = partial(colormap_select, reverse=(background!="black"))

Also, we need notebook to be large

In [27]:
display(HTML(""))

Now let's define our data-side canvas coordinates. we can simply reproj them from lot/lan as well

In [28]:
sw = nyc( -74.15, 40.463661  ) # reproj
ne = nyc( -73.66, 40.947435  ) # reproj

NYC = x_range, y_range = zip(sw, ne)

cvs = ds.Canvas(plot_width, plot_height, *NYC)

Dencity

First, lets just count tweets for each point.

In [51]:
count = cvs.points(d, 'x', 'y')

Lets start with linear. It is actually 99% times a bad idea, but yet.

In [31]:
export(tf.interpolate(count, cmap = Greys9, how='linear'),'tweets_density_linear')
Out[31]:

As we expected, it is really not helping, lets stick with equal histogram. This mean, that for each color in the colormap, buckets are adjusted, so tha each color represents equal number of points

In [32]:
export(tf.interpolate(count, cmap = Greys9, how='eq_hist'),'tweets3')
Out[32]:

Now, grey is kinde boring, and hard to define real clusters of density

In [33]:
export(tf.interpolate(count, cmap=viridis, how='eq_hist'), 'colored_total')
Out[33]:

Applications

Now, lets define, which of 4 top application is the most popular for each point

I actually started defining colors. Strange thing to start with, but this way I am able to use keys to filter apps later

In [34]:
if background == "black":
      color_key = {'foursquare':'aqua', 'twitter for iphone':'white', 'instagram':'red', 'twitter for android':'grey'} #, 'o':'yellow' }
else: color_key = {'foursquare':'blue', 'twitter for iphone':'white', 'instagram':'red', 'twitter for android':'grey'} #  'o':'saddlebrown'}

Filter data for top-4 applications, just as with pandas

In [46]:
appDf = d[d.application.isin(color_key.keys())]

Now, lets turn application to categorical type

In [47]:
appDf = appDf.assign(application=appDf.application.astype('category'))
In [48]:
appDf.application.value_counts()
Out[48]:
twitter for iphone     20081109
instagram               5575630
twitter for android     4823926
foursquare              2506839
Name: application, dtype: int64

Now count by category

In [49]:
appCount = cvs.points(appDf, 'x', 'y', ds.count_cat('application'))

And plot

In [50]:
export(tf.colorize(appCount, color_key, how='eq_hist'), 'colored_apps')
Out[50]:

Daytime

now, lets visualise our daytime. Here, I use "hsv" colormap, as I want numbers for 00:05 and for 23:55 to be close enough.

Also, I remove noise (points with less than 10 tweets), using count aggregate, which we already computed

In [55]:
treshold = 10
In [52]:
aggDaytime = cvs.points(d, 'x', 'y', agg=ds.mean('daytime'))
In [56]:
colormap = plt.get_cmap('hsv')
export(tf.interpolate(aggDaytime.where(count > treshold ), cmap=colormap, how='eq_hist'), 'colored_total')
Out[56]:

And that is the end of the notebook.

In [ ]:
 

© Philipp Kats. Built using Pelican. Theme by Giulio Fidente on github.